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Abstract 

This work considers tree algorithms for the N-body problem where 
the number of particles is on the order of a million. The main concern 
of this work is the organization and performance of these computations 
on parallel computers. 

The work introduces a formulation of the N-body problem as a 
set of recursive equations based on a few elementary functions. It is 
shown that both the algorithm of Barnes-Hut and that of Greengard- 
Rokhlin satisfy these equations using different elementary functions. 
The recursive formulation leads directly to a computational structure 
in the form of a pyramid-like graph, where each vertex is a process, 
and each arc a communication link. 

The pyramid is mapped to three different processor configurations: 
(1) A pyramid of processors corresponding to the processes pyramid 
graph; (2) An hypercube of processors, e.g., a connection-machine like 
architecture. (3) A rather small array, e.g., 2 x 2 x 2, of processors 
faster then the ones consider in (1) and (2) above. 

The main conclusion is that simulations of this size can be per- 
formed on any of the three architectures in reasonable time. 24 sec- 
onds per time step is the estimate for a million equally distributed 
particles using the Greengard-Rokhlin algorithm on the CM-2 con- 
nection machine. The smaller array of processors is quite competitive 
in performance. 

Keywords: N-body problem, particle simulation, tree algorithms 
for particle simulation, parallel computing. 
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1 Introduction 

The study of physical systems by particle simulation is called the "many- 
body" or the "N-body" problem. Such studies are conducted in celestial 
mechanics, plasma physics, fluid mechanics as well as in semiconductor device 
simulation [1]. The simulation finds the trajectories of each particle over some 
time interval, given the initial position, the initial velocity, the external force 
and the nature of the forces that the particles exert on each other. 

The number of particles used for such studies is quite large. It is es- 
timated that to get insight into three-dimensional turbulent flow about 1 
million particles are needed [2]. Thus, such simulations require intensive and 
prolonged computations and the question of efficiency is of great interest. 

This work considers a family of algorithms for calculating the force or the 
potential $ which can be called "tree algorithms". These algorithms reduce 
the asymptotic computational complexity from 0(N 2 ) in the naive approach, 
where each particle interacts with each other particle, to 0(N log N) in [5] 
[6], and down to O(N) in [7] (for particles in two dimensions) and in [9] and 
[11] (for particles in three dimensions). The main goal of this work is the 
organization and performance of such algorithms on parallel computers. In 
particular, it considers how such computers should be organized in order to 
handle tree algorithms efficiently when the number of particles is very large, 
and how time is divided between computation time and time for communi- 
cation between processors. 

The work introduces a formulation of the many-body problem as a set of 
recursive equations based on a few elementary functions. It is shown that the 
algorithms of both [6] and [7] satisfy these equations using different elemen- 
tary functions. The recursive formulation leads directly to a computational 
structure in the form of a pyramid-like graph, where each vertex is a process, 
and to a SIMD computational algorithm. 

The pyramid is mapped to three different processor configurations: 

• A pyramid of processors corresponding to the processes pyramid graph, 
where each vertex is assigned a process and each arc — a communica- 
tion "wire". 

• An hypercube of processors ,i.e., a connection-machine like architecture 

[12]. 



• A smaller array of larger processors , e.g., 2x2x2 processors. 

Computation and communication time are computed for the first two 
cases and estimates are derived for the third case. 

The main qualitative conclusion is that large simulations (about 1M par- 
ticle) can be performed on any of the above architectures in reasonable time. 
This is particularly interesting with respect to the connection machine - an 
existing, ready to be used system - but also noteworthy with respect to the 
array of processors which is quite competitive in performance. 



2 A Recursive Formulation of the N-body Problem 

We consider N point particles distributed in two or three dimensional space 
(2D or 3D). We are interested in the forces that these particles exert on each 
other. We use celestial mechanics terminology (point masses and Newtonian 
gravitational fields) although the method is equally applicable to electrostat- 
ics (electrical charges and electrostatic fields) and fluid mechanics (vortex 
particle and vortex field) [1]. 

The force that a particle at a point x, x € .ft 3 , exerts on a particle y, 
y 6 ft 3 , is given by 

_ M x M y 
hv ~ ^Tj fTT^r (1) 

ll*-y II 

where 1 P is a unit vector in the x - y direction. The forces are additive, i.e., 
the total force on any particle p, F p , is the sum of the forces due to all other 
particles 

f p = £ /* (2) 

q£particle3,q^p 

An additional valuable property of these forces is the symmetry, f pq = 
-f qp . It is also convenient to express the forces in terms of a potential 
function, $, such that the force on a unit mass at x is 

F x = -grad $ (3) 

The tree algorithm proceeds by dividing the space into "computational 
boxes". The method is illustrated by its 2D version which is simpler to 
explain; the generalization to 3D is straightforward. 

We choose a (2D) rectangle such that all the particles are included in it. 
This is called the top rectangle. It is partitioned into four equal rectangles, 
called the children of the top rectangle. Top is called the parent of the 
children rectangles. Each child is partitioned to four equal rectangles. This 
process continues to an arbitrary level, say h. For compatibility with prior 
work [11], the level of top is named the - level, so the bottom level becomes 
the h-th. level. 

The set of neighbors of a computational box b comprises all boxes of 
the same size whose boundary has at least one point in common with the 



boundary of b. Thus, in 2D a box has at most eight neighbors; in 3D a box 
has 26 neighbors. 
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Figure 1: The computational box, its parent, children and neighbors; Aa,Ab, 
Ac,Ad are children of A; A is a parent of Aa,etc.;The neighbors of Ad are: 
Aa,Ab, Ba, Bb,Da,Cc,Ca, and Ac. 



It is convenient to partition the force, or the potential, into the far field 
and the near Held so that $ p = $ Pynear}ield + $ p ,far field- The reasons for this 
partition are twofold: (1) Approximation formulas valid for the far field po- 
tential are not valid for the near field potential; (2) Considerations related to 
the finite computer word size and to global accuracy necessitate modification 
of the near field at small distances (smoothing [3] [4]). In partitioning to far 
and near fields we follow [7,9] by taking the field at computational box due 
to particles in itself and in its neighbor boxes to be the near-field potential; 
the field in the box due to all boxes which are not neighbors is the far-field 
potential. 

Let $ n be the far-field potential function due to particles in the box n. 
For any x,x gn and x g neighbors(n), $ n (x) is the contribution of particles 
in n to the potential at x. 

Let $ n be the far-field potential in n; i.e., for any x, x €n, $ n (x) is the 



contribution of all particles q, q & n, and q $■ neighbors(n) to the potential 
at the point x. 

A recursive relation among $ n and $ parent ( n) is the heart of the tree 
methods: 

* n = <*Went(n) + £ $, (4) 

t€/n 

Where /„, the set on which the summation of $, is done, is called the 
interaction set of n and is defined by 

h = {j\j € children(neighbors(parent(n))),j £ neighbors(n)} (5) 
A solution of 4 is defined by its boundary conditions 

* top = 0, (6) 

and by 

$n= £ *i (7) 

t'6cAi/<iren(n) 

where n is not a bottom level rectangular; for a bottom level n 

*n = $°, (8) 

where $° is the far-field potential function given in terms of the particles 



in n. 



3 The Barnes-Hut and the Greengard-Rokhlin Meth- 
ods 

The Barnes- Hut (BH in the sequel) [6] and the Greengard-Rokhlin (GR) [7,9] 
methods both solve the above recursive equations. Their differences are in 
how the potential functions are represented and approximated. 

For presenting the material in this section it is useful to distinguish be- 
tween a function, say, /, its approximation, / and the representation of this 
function or its approximation, rep(f) or rep(f). Usually, we want to repre- 
sent the function itself, however, for various reasons we carry only enough 



information to define the approximation completely. Thus, by design, rep(f) 
represents /. 

In the BH method the function $£ is represented by two quantities (m,c) 
where m is the sum of the masses of the particles in the box n, and c is the 
center of mass vector. Thus, $£ at a point y, y g neighbors(i) and y i, is 
approximated by $£ which is the potential of a mass m at c evaluated at the 
point y. I.e. 

♦iw = GjJZ^ (») 

Similarly, $ n , for level I, I ^ h, approximates $ n . It is calculated by 
summing the masses of the children and finding a center of mass of the 
masses of the children. The representation of $ n is 

rep(*n) = ( £ m , ^6c^r«n(n) m,C- ) ^ 

i€children(n) 2^t'6c/i«Wren(n) rn i 

The approximation to # n , tf n , is derived from 4; * n is represented as a list 
of pairs; the pairs (m,-, cj) of all the members of interactive set are appended 
to form a list which, in turn, is appended with a similar list representation 

^porent(n)- 

The actual potential, or the force at a point p, is calculated by evaluating 
the appropriate #„, n at level h, by evaluating each member of the list at p, 
and by adding the near-field effect to the result. 

The 3D version of the GR method is described in [9] and in [11]. The 
much simpler 2D version suffices for showing how the algorithm fits the above 
computational structure. 

In the sequel we identify R 2 with the complex plane C which simplifies 
the notations. Let n be a box on some level. For all z, z £ n and z £ 
neighbor s(n), the GR method considers $ n as its multipole expansion around 
the center of n 



oo 

*»(*) = aolog(*) + E^- (11) 



For all z, z e n, the GR method considers $ n as its local (Taylor) expan- 
sion around the center of n, 



*n(z) = E b >* 1 - (12) 

/=0 

In both cases the functions axe approximated and represented by their 
first p + 1 coefficients, where p is chosen according to the accuracy required. 

To derive $ n from the $'s of its children, equation 7, the child represen- 
tation is "translated", i.e., each child $j, j € children(n) is expanded to a 
multipole expansion around the center of n. Once this is done, the $'s of all 
children are expanded around the same origin and $ n is found by summing 
the coefficients of the translated $_,-. 

To derive <£„, (equation 3) the $,, i € interactive set of n, are converted 
to a local expansion around the center of the box n. ^ , poPent ( n ) is in this local 
form and, therefore, the remaining operation is to translate (shift) the local 
expansion from the center of parent(n) to the center of n. Now, all these 
functions are represented in n by their local expansions around the same 
point, and \P n is obtained by summing up the coefficients of all these local 
expansions. 

Details of the GR algorithm follow: 

Let n be a box of level h (bottom level). Let the center of the box be the 
coordinate system; let the box have m point masses {?,-, i = l,...,m} located 
at points {z,-, i = 1, ...,m} in the box; let r be the diagonal of the box , thus, 
\zi\ < r. Then, for any z 6 C with \z\ > r, 



Qk 



K(z) = aolo g (z) + ^^, 



(13) 



fe=i 



where 



a = ^2 q { and a* = ^ 



-*(*r 



$„ is approximated by the first p terms of the infinite series. It can be shown 
that for any p > 1 , 



<Mz)-a log(z)-£^ < 



fc=i 
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where 



^ = £k'l 



t=l 



Cleaxly, $_, is represented by { a k ,k = 0,...,p } and the coordinates of 
the center of the n-th box. 

To find $,, i not of level h, one uses equation 4. To sum the $_,, ; e 
children(i), the expansion representing $, is "translated"; i.e., each $_, is 
expanded to a multipole expansion around the center of i. Once this is done, 
the $'s of all children are expanded around the same origin and $, is found 
by summing the coefficients of the translated $,. 

Let the coefficients a iy i = 0, ..., oo, specify the multipole expansion of $ n 
around the point z . The translation of this expansion to the origin is given 
by 

*n{z) = aolog(z) + J£^, (14) 

/=i z 
where 



bi 



r + S^' : ■ < 15 > 



with (Jj the binomial coefficients. 

*„ is formed by retaining the first p + 1 terms of the above series. It is 
represented by { b kj k = 0,...,p } and the coordinates of the center of the 
n-th box. 

#„ (equation 3) is computed by converting the $,, i e interactive set of n, 
to a local expansion around the center of the box n. This conversion is given 
by the following: 

Let { a*, k = 0, ... } be the representation of some $_,-, j is not a neighbor of 
n. Let n's center be the origin, and let the center of j be z . The contribution 
of $j to $„ at a point z in n is 

oo 

E &<■*'> (16) 

where 

00 

bo = aolog(-zo) + J2 -nf , (17) 



-z k 

it=i z 
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and 

The shifting of ^porent(n) to the center of n is given by the following 
equation: For any z , z and a*, A: = 0, 1, ...,p, 

£ a k (z - z ) k = £(£ a k (?) (-z ) k - l )z<. (19) 

Several functions can be defined to summarize the GR algorithm: 

Let '+' be an operator that maps representations of the same type which 
are expansions around the same point z into a representation around z each of 
its coefficient is the sum of the appropriate coefficients of the '+"s arguments. 

Translate; maps a representation of a multipole expansion to a multipole 
expansion around the center of box z (equation 14). 

Shift z maps a representation of a local expansion to a local expansion 
around the center of box z (equation 19). 

Convert; maps a representation of a multipole expansion to a local ex- 
pansion around the center of box z (equation 16). 

Under these conventions the computation process by which the GR algo- 
rithm computes equations 4 and 7 can be written in the following way: 

rep($ n ) = J2 Translatenirepfii)) (20) 

iGchildren(n) 



rep(tf n ) = Shift n (rep(V parent(n) ))+ £ Convert n (rep{$i)) 

i€{ili€cAtldren(n«i0j»6or«(parent(n))), 
j £ncigh.bors(n) } 

(21) 
The operations of the BH algorithm can be described with functions bear- 
ing the same names with one minor difference: Translate; has all the repre- 
sentations of the children of z as arguments and the summation is omitted. 
I.e., equation (20) becomes 

rep($ n ) = Translate n (rep($i),rep($2), •••, re P($t')5 •••) a-U h » € children(n) 

(22) 



11 



4 The Computational Structure 

This section considers a network of processes, each associated with a region of 
space (a cell, a rectangular for 2D, a box for 3D). Each process communicates 
with its parents, its children and its neighbors to compute the solution to the 
N-body problem according to the above algorithm. If a processor is assigned 
to each process and all processors operate in a SIMD style, this architecture 
is "connection machine like" [12] in the sense that there are many equal 
processors operating in SIMD style which communicate via many connections 
to "neighbors". The difference between the connection machine and this 
network is that this network is not a hypercube; instead, its connections are 
derived from the structure of the equations. Our purpose is to find bounds 
on the computation time of this network architecture in order to compare it 
with the connection machine and other architectures. 

A look at the equations shows us the following communication paths for 
each process (cell) C: 

a. C to C's parents (equation 8), 

b. C's parents to C (equation 4), 

c. C to each of its children (equation 4), 

d. Each child of C to C (equation 8), 

e. C to C's interactive set (equation 4), 

f. Each member of C's interactive set to C (equation 4), 

h. C to its neighbors (h level only for the calculation of the near-field), 

i. Each neighbor of C to C (h level only for the calculation of the near- 
field). 

If we abstain from sending messages in both direction at the same time 
(as long as the SIMD discipline is maintained this does not occur) a and b, 
c and d, and h and i can be satisfied by the same "wire". Moreover, the 
interactive set is symmetric, i.e. for any box a, b, a € interactive set(b) 
implies b € interactive set(a). Therefore, e and f can also be united. 

Thus, we need the following connections at each process C: 

12 



a. C to/from C's parents (8), 

b. C to/from each of its children (4), 

c. C to/from C's interactive set (4), 

d. C to/from its neighbors (h level only for the calculation of the near- 
field). 

If each connection is implemented, connection machine style, by one wire, 
we have 32 wires in 2D (1 parent, 4 children, up to 27 members in interactive 
set). In 3D we have 198 wires (1 parent, 8 children, up to 189 members in 
interactive set). This number seems high; certainly it cannot be implemented 
as one physical wire pef connection. However, we can reduce the number of 
connections as follows. 

The information sent from C to its interactive set members is the same 
as that sent to its parents. C's interactive set members can be reached by 
having C's parents forward the information to their (26) neighbors, who,in 
turn, forward it to their children. Since each message carries its source, the 
children can check if the sender, C, is in the child's interaction set and discard 
the information on a negative answer. 

This argument replaces connection to interactive set members with con- 
nections to neighbors. The number of connections per cell becomes 13 for 
2D and 35 (1 parent, 8 children, 26 neighbors) for 3D, which, although not 
low, is much better. In such an arrangement there are connections to par- 
ents, to children and to neighbors for all processors, including those at the h 
level. Figure 2 illustrates the connection structure in 2D, the equivalent 3D 
structure is somewhat difficult to draw. 

Figure 3 shows that the computation structure has the form of a graph 
where each node describes a cell and each branch describes a wire connecting 
two cells. The graph has a form of a pyramid with the top node corresponding 
to the top cell and the bottom nodes to the h level cells. Levels, or cells of 
equal size, appear distinctly in the graph; they are nodes of equal height from 
the bottom or the top of the pyramid. 

Consider an algorithm that performs the far-field computation in the 
following steps: 
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Figure 2: A 2D cell (node) and its connections 

Algorithm: 

1. rep($°): Compute rep($°), for all n, n in level /i. (The exact represen- 
tation and method of computation depends on the method used and 
will be elaborated on in the sequel.) 

2. Calculate rep($ n ) going up the tree: Given rep($°), and using equa- 
tions 8 and 20 compute rep($„) for all n, level by level, starting in level 
h — 1 and ending in level 0. 

3. For all nodes of the tree calculate the effect of the interactive set mem- 
bers: For all direction d in a level do 

(a) Each node n sends to its interactive set member that lies in direc- 
tion d from it the rep($ n ) for all p, p a child of n; 

(b) Each node n receiving such an message plays the role of a neighbor 
and retransmits it to all its children; 

(c) Each node n (playing the role of a child) computes Convert z (rep($i)) 
for all messages received from one of its interactive set members 
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Figure 3: The computation "tree". 

and discards the rest of the messages. The result is accumulated, 
awaiting the arrival of rep(^,) from n's parent. 

4. rep{^l n ): The values of rep(^ n ) are propagated down the tree to com- 
pute rep(^ n ) for all children according to equation 21. 

5. Evaluation: For all n in level h, rep(^! n ) is used to calculate the far-field 
potential at each mass-point in the cell. 

Note that all the h level nodes can perform step 1 simultaneously. Simi- 
larly, step 2 is simultaneously performed by all nodes of the same level. Step 
3 is simultaneously performed in parallel by all nodes of the entire network. 
This holds if we understand that a node without children does not send them 
messages, etc. 

Some additional consideration of the algorithm shows that our tree con- 
sists of three kinds of nodes: the top node, the intermediate nodes, and the 
bottom level nodes. Figure 4 illustrates the functions computed and the 
information flow in each node. 

The above algorithm differs from the algorithms described in [9] and 
in [11] in several aspects. First, the algorithm is really a frame in which 
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Figure 4: Computation and Information transfer in tree nodes. 

different functions may be substituted for $ n , Translate z , etc. The algo- 
rithm is described as a "tree" of processes operating in parallel rather than 
a single process algorithm as in [9], or the level- by- level parallel algorithm 
as in [ll](page 50). More importantly, this algorithm handles the informa- 
tion transfer explicitly and so reduces both communication and computation 
times. The main tool for accomplishing this is the parallel structure. Infor- 
mation is sent in parallel so that messages do not conflict and computing is 
done in parallel by all processes of the tree. 

The network of processes can operate also in the "data flow" mode where 
each process performs the computation once the data is available. The op- 
eration of the network in this mode has not been fully investigated as yet. 
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5 Communication Time 

The time required to solve the N-body problem by the above computation 
structure consists of the time that each cell spends computing plus the time 
required to transfer the information from process to process. The later com- 
ponent is named communication time. The communication time is bounded 
under several assumptions. 

1. Time is sampled and divided into intervals; during one time-interval a 
process can both send one message and receive one message. 

2. Messages are of bounded length. 

3. The time to send a message is one "message time" unit. 

When the messages are small, one can add the assumption that processes 
can combine several messages into one message. It turns out that in our 
applications sizes are such that this assumption should be be omitted. 

It is further assumed that each function representation is transferred by 
one message (this assumption will be modified later on). Under the above 
assumptions we get the following communication time bound (Notice that 
under the SIMD assumptions all processes are doing the same operation at 
the same time.) 

Consider the algorithm for far-field computation in the previous section. 
Let us assume that in 2D there are n x n regions {nxnxn regions for 3D) 
with h = [log n] . Denote by n c hu<tTen the number of children of a processor 
( processor level is not h). Denote by n n the maximum number of neighbors 
of a processor. Denote by n aet the maximum number of members in an 
interactive set. The following lists the communication time in message-time 
units for each step of the algorithm: 

Step 2 Since each processor can send one one message and receive one message 
at a time, the time is n c hiidren x h. 

Step 3 Here each processor sends n c «wren messages to each neighbor which, in 
turn, sends one copy to each of its children. The time is n n (n c hud Ten + 
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Children)- Note that messages can be sent and received simultane- 
ously by all processors without interference provided that each proces- 
sor sends its message in the same "direction"; i.e. all processors send 
to the left and receive from right, etc. 

Step 4 Here each processor sends a message to each of its children. The time 

IS n c hildren X h. 

Thus, the total communication time in message time units is: 
communication time < 2h.n chi i dren + n n x (n cA ,- Wren + n 2 children ). (23) 

In terms of actual time what is "message time"? 

The GR algorithm specifies a function by p coefficients and by the coor- 
dinates of the center of the box around which the function is expanded (the 
source of the message). Since messages are function descriptions, all mes- 
sages are of equal length and the concept of message time is well defined as 
the time required to send a function description. For example if p = 10 and 
each coefficient and the source are represented by 32 bits, the length of the 
message is about 352 bits. The connection machine messages, for example, 
are 190 bits long; therefore, on that machine, two actual messages are needed 
to implement our 352bit message and the "message time" becomes the time 
for two connection machine messages. 

The case of the BH messages is somewhat more complicated. Fixed size 
messages, each consisting of (center of mass, mass, source) travel up the 
tree. Therefore, the contribution of steps 2 and 3 of the algorithm is as 
above, totaling 

h.n children + n n x (n ch u dren + n 2 chi , dTen ). (24) 

However, on the way down, the number of such triplets that pass each 
node of the tree equals the number of interactive set members of all the 
node ancestors. Moreover, this function description is not a message of fixed 
length and the estimation of communication time has to be done differently. 

The number of triplets that a node at the bottom of the tree (pyramid) 
receives is (less then) 

n, et x h. (25) 
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If messages are transmitted starting from the bottom of the tree, the above 
is the number of messages step 4 requires. It is an upper bound since some 
nodes have small interactive sets (e.g, top has zero members). Thus, the total 
number of messages in the BH algorithm is: 

h X n c hildren + "n X ("c/uMren + «c«Wren) + n set X h. (26) 

With k triplets per message, the communication time is the above expression 
over k. 

The following argument supports the above expression: 
Choose an arbitrary node, say A. A sends down the tree the triplets of 
its interactive set. As soon as the last triplet clears A's child, the child can 
send its own interactive set triplet down the tree. Thus, for a tree of height 
h and a bound on the number of members in an interactive set, we get the 
above expression as a bound on the communication time. 
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6 Computation Time 

This section estimates the computation time resulting from the arithmetic 
operation only. It does not include the communication time and it does not 
include data transfers inside processors. The estimation is made for both 2D 
and 3D using both GR and BH algorithms. 

The time to perform floating point addition, multiplication, etc. is de- 
noted by t+,t m , etc. t+, etc. denotes the corresponding time to perform the 
corresponding complex arithmetic. It is assumed that i+ = 2i+, and that 
L = 4U + 2t + . 

6.1 GR Algorithm in 2D 

It is assumed that we maintain p + 1 terms in each expansion. 

6.1.1 $° 

This calculation follows Theorem 2.1.1 in [9]. 

It is assumed that in each cell there are m mass points $ at points Zi,i = 
l...m. 

1. The calculation of zf , i = 1, ...m, k = 1, ...,p, requires m(p - l)i t . 

2. The calculation of a^ requires: 

For k = the time is (m - l)t+; For k ^ the time is (m-l)i+ + 2mi m ; 

3. Summing the above and replacing the complex arithmetic entries by 
their equivalent floating point arithmetic: 

(m-l)< + +2(m-l)pt + +m(3p-l)i\) = (8mp-2p-m-l)t + +m(12p-4)U 

(27) 

6.1.2 Translate z 

This calculation follows Lemma 2.2.1 in [9], equation 15 above. Notice that 
at each level there are only four different values for z . These differ from 
each other by multiplication by j (the complex rotation by |). As one goes 
up a level the corresponding z Q is multiplied by 2. Thus, the values of z l , 
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/ = l,...,p can be assumed to be precalculated., and 15 takes the following 
format: 

k _ _ a^nf + ^ t(r2 >-^ Zo) .->g-_ij. m 

where level denotes the cell level, and m = 0, 1,2, 3. 

1. The calculation of b\ requires: 

For / = — no operations required; For / ^ the time is 3/t, + li+. 

2. Summing the above and replacing the complex arithmetic entries by 
their equivalent floating point arithmetic: 

3 Pl + *P L + (£+f) f+ = 4( P 2 + p)t + + 6(p 2 + 3p)t.. (29) 

6.1.3 Convertg 

This calculation follows Lemma 2.2.2 in [9], equation 18 above. Notice that 
here , as in the above section, at each level there are only several different 
values for z ; each corresponds to the position of a cell relative to each of its 
interactive set members. Here too, as one goes up a level the corresponding 
z is multiplied by 2. Thus, the values of z l , 1 = 1, ...,p can be assumed to be 
precalculated; the number of sets of numbers equals the maximum number 
of members in an interactive set. With z l Q assumed known constants, and 
taking into account the multiplications by powers of 2, we get the following 
operation counts: 

1. The calculation of bi requires: 

For / = — (2 + 2p)i m + pi+. For / ^ — (2 + 2p)i. + pi + . 

2. Summing the above and replacing the complex arithmetic entries by 
their equivalent floating point arithmetic: 

(2+4p+p 2 )*; + (p+l)pt + = 4(2+4p+p 2 )t.+(2.(2+4p+p 2 ) + (p+l)p)i + 

(30) 
= (8 + 16p + Ap 2 )U + (4 + 9p + 3p 2 )t+ (31) 
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6.1.4 Shift, 

This calculation follows Lemma 2.2.3 in [9], equation 19 above. As in the 
Translate, case, at each level there are only four different values for z 
differing from each other by multiplication by j (the complex rotation by ^). 
As one goes up a level the corresponding z is multiplied by 2. Thus, again, 
the values of z l , I = 1, ...,p can be assumed to be precalculated. 
From 19 follows that 

c ' = E a *(fc)(-*o)*-'. (32) 

1. The calculation of cj requires: 

3(p -/)*. + (?-/)*+• (33) 

2. Summing the above and replacing the complex arithmetic entries by 
their equivalent floating point arithmetic: 

^ { 3L + i + ) = (Z±±k(l2t. + 8t + ) (34) 

6.2 GR Algorithm in 3D 

The following calculations follow the 3D expansion by F. Zhao [11]. Zhao 
expands the various functions to a series of the derivatives of ^ where R is 
the distance from the center of a cell (whose function is being computed) to 
the point at which the potential is desired. To avoid evaluation of (direc- 
tion) cosines, they have been expressed in terms of the cartesian coordinates 
(x!,X2,x 3 ). In the sequel, p is the order of the highest derivative retained. 
The number of terms in the expansion, denoted by n p , is derived from p. 

6.2.1 $° 

This calculation follows Theorem 3.2.1 in [11]. 

It is assumed that in each cell there are to mass points qi at points 

(xij,x 2 ,-,x 3 ,), i = l...m. 

««* = X>(-1)'' +J+ *7T^44- (35) 
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If we retain only terms of i + j + k < p then the number of terms 



is 



nr = f E '^ = (p±ifc2W (36) 

it=o j=o «=o ° 

which results in nj = 4, n 2 = 10, n 3 = 20, n 4 = 35, etc. 

1. Consider the following recursive relations between the coefficients: 

ao,o,o = 1 (37) 

(-l)x,- (~l)xj (-1)** 

2. There are n p recursive steps each corresponding to a different a,^. 

3. Each recursive step requires Zt m (i is stored as 4). 

4. The above is repeated for m particles and accumulated, giving the 
result (m — 1)£+. 

5. Summing all up results in the following bound 

n p (3mi. + (m - l)i + ) (39) 

6.2.2 Translate z 

This calculation follows from Lemma 3.2.1 in [11]. The coefficients of the 
translated expansion, Cijk, are 

i i fc 

°«i* = EEE a «/87 a !--«j-/j,*-7- ( 4 °) 

at=0 /3=0 7=0 

Here a^ fc are the coefficients in the expansion for 1/R, measured from the 
old cell center with respect to the new center. 

1. The a' iik are constants that can be precomputed. The number of differ- 
ent sets of such constants equals the number of children of a cell — 8 
in 3D. The constants at different levels can be derived by multiplying 
by appropriate powers of 2. 

23 



2. The number of terms in Qjk is (i + l)(j + l)(k + 1). 

3. Each term is a multiplication of an a', a power of 2 and an a,-,*; therefore 
each term requires 2<» + 1+. Thus, each c^* requires (i + l)(j + l)(k + 
l)(2t m +t+). 

4. Summing all up, i + j '• + k < p, results in 

E E "if \i + 1)0" + 1)(* + 1)(2<. + U) = (41) 

Jt=0 j=0 «'=0 

^(p 6 +21p 5 +175p 4 +735p 3 +1624p 2 +1764p+720)(2<.-K+) ^ k„(2U+t+). 

(42) 

The above expression (caculated by F. Zhao using MAXIMA [19])is denoted 
by k p . Thus, the time is 

k p (2t m + 1+). (43) 

6.2.3 Convert z 

This calculation follows from Lemma 3.2.2 in [11]. The coefficients of the 
converted expansion, c,jjt,are 

Cijk = 4UW E ^^b i+a , j+0 , k+ ^(i + a)\(j + /?)!(* + 7 )! (44) 

bijk are the coefficients of the local expansions for 1/R, measured from the 
old cell center, with respect to the new center. 

1. The b^k are constants that can be precomputed. The number of dif- 
ferent sets of such constants in the number of interactive set members 
of a cell — 189 in 3D. The constants of different levels can be derived 
by multiplying by appropriate powers of 2. 

2. Each term of cy* requires at most (2t m + t + ). (There are three items; 
a a 0^, a power of 2, and a constant.) 
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3. To calculate the contribution of all terms note that both ay* and bijk 
are zero if one of their coefficients is larger then p. Thus, the amount 
of time is 

p p-kp-k-j p p-ap—a—0 

EEEEE £(2*.+* + )= (45) 

fc=Oj=0 t'=0 a=kP=j f=i 

= ^(3p 4 + 22/ + 57p 2 + 62p + 24)(2*. + t + ) d ^ Cp(2*. + * + ) (46) 

A crude upper bound can be derived simply in the following way; If the 
vanishing of the 6,j* is not accounted for, the estimate for the number of 
terms of cy* is n p ; Since the number of c^'s is n p , the bound on time is 
rip(2t m + t + ). The ratio between the bound of equation 46 and this last 
bound is about p/n p ; This illustrate the advantage in the result of equation 
46. 

6.2.4 Shift, 

This calculation follows from Lemma 3.2.3 in [11]. The coefficients of the 
converted expansion, dijk,axe 



<*** = a t Q CHaj+pM*, (' + a ) (' Y)( k+ 7 7 ) Ax? Axf Axg. (47) 

Here (Axi,Ax2, AX3) are the coordinates of the old center in terms of the 
new center. 

1. The Ax,- are constants that can be precomputed. The number of dif- 
ferent sets of such constants is the number of children of a cell — 8 in 
3D. The constants of different levels can be derived by multiplying by 
appropriate powers of 2. 

2. Each term of dijk requires at most (3i, -K+) (there are four terms: c,-^, 
a power of 2, the combinations and the A's). 
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3. Summing over ail terms and all coefficients results in the following 
bound 

p p-kp-k-j p p-ap-a—0 

EEEEE E(3'.+< + ) (48) 

k=Oj=0 «=0 a=fc/3=i -y=« 

= 24 (3p 4 + 22p 3 + 57p 2 + 63p + 24)(3t. + 1+) = Cp(3*. + 1+) (49) 

6.2.5 Evaluation 

The evaluation of the potential at each mass point requires the evalu- 
ation of ^ n , n at level h, using equation 



i,j,k=0 

for the potential. The force can be obtained by taking the partial 
derivatives; Thus, 

p 

"xi {% ) = 2^i dijk tx i X 2 X 3' 
i,j,k=0 

The potential can be evaluated with two multiplications and one addi- 
tion per term (keep x\x 3 2 x^ for each ijk and proceed by increasing the 
indices one at a time). For m mass points in z, the evaluation of the 
potential requires 

mn p (t+ + 2t m ). (50) 

For m mass points in z, the evaluation of the force requires about 

m(n p (t + +2Q + 3n p U). (51) 

The above was derived by assuming that first the members of the series 
were calculated with a power smaller by one for each a;,-, and, next, for 
each component of the force, the series components where multiplied 
by the omitted components. (I.e., multiplying by 1x2X3 for F x \ and 
ignoring the operations for forming 2x2X3.) 
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6.3 BH Algorithm in 3D 

The 2D BH algorithm is similar to the 3D version ; its detailed opera- 
tion count is omitted. 

Only the calculation of $° and Translate z , as well as the final evalu- 
ation, have significant operation counts; Convert z and Shift z amount 
to concatenation of lists. 



6.3.1 $° 

The following time is obtain from equation 10 (in 3D): 
(m - l)t+ + ZmU + 3(m - 1)< + + 2t, = 3m<» + 4(m - l)t+ + 3*/. (52) 

6.3.2 Translate z 

Translation is the above calculation with m = 4: 

12*. + 12<+ + Zt/. (53) •. 

6.3.3 Evaluation 

The evaluation of the potential, or force, at each mass point requires the 
evaluation of each member of the list representing ^ n , n at level h, us- 
ing equation ?? for the potential and equation 1 for the force. #„ , n at 
level h, has at most (h— 2)*{maximum number of members in an interactive set) 
elements. That maximum is 189 in 3D. 

The force between two points (equation 1 ) is evaluated according to 

Fxi = G M * M ,«-** (54) 

||*-y|| ll*-y|| 

and requires (3i, + 2t+ + 3£_) + 15*. + 3i_ + 3i/ where the 15 is an 
estimate of the time for evaluating a square root. With t- = £+ the 
result is: 18t, + 8t + + Zt/. 
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7 A Comparison 

In order to gain insight into the above results this section maps the pyramid 
of processes to three different processor configurations: 

• A pyramid of processors corresponding to the processes pyramid graph, 
where each vertex is assigned a process and each arc a communication 
"wire" . 

• An hypercube of processors ,i.e., a connection- machine like architecture 

[12]. 

• A smaller array of larger processors , e.g., 2x2x2 processors. 

The performance of the BH and the GR algorithms are compared by calcu- 
lating their communication and computation times in these configurations 
The calculation is done for several typical values of N, the total number of 
particles, m number of particles in a cell, p and n p that determine the number 
of coefficients that the expansion carries. To get some idea of actual time 
in seconds the speed of communication and floating point operations of the 
CM2 model connection machine is used. 

7.1 The BH Algorithm - Comparative Speed 

The section starts by summarizing results from previous sections. 

messages: It is assumed that the BH algorithm triplet consists of 4 single 
precision 32bit word plus one 32bit word for specifying the origin of the cell. 
Thus, the 160 bit message fits into the 190 bit connection machine message; 
message-time unit becomes the time for one message on the CM2 connection 
machine which is about 0.250 millisecond. 

Communication time: In 3D we have 8 children, 26 neighbors, and 189 
as the maximum size of an interactive set. Thus, the communication time is 
197h + 1872 (equation 26). 

Computation time: To simplify the discussion it is assumed t+ = <„; 
tj = 5t + . Under these simplifying assumptions the time spent in computing 
is: 

1. $n°: (7m + 11)*+ ; 
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2. Translate^. 39ht+; 

3. Evaluation: I89m(h - 2)(41)*+. 

Cells and particles: Let the number of cells in the bottom level,/*, be 
N = n x nxn ; then h = |"log 2 n] . The total number of cells is: 



"<i + 5+s + ™>*!" 



(56) 



Since the number of processors in the connection machine is 64K , we 
would like to restrict the number of cells to this number. 
The number of particles is mN. 

Summary: The following table summarizes communication and compu- 
tation times for selected values of n and m. The communication time is in 
message-time units and the computation time is in multiples of t + . 
n m particles h communication 

message units 
2857 
2857 
2857 
3054 
3054 
3054 

To get some insight into these numbers we recast the above table using 
20Kflops per second for t + (CM2 model connection machine measured time 
[14]), and 250 microseconds per message (CM2 model connection machine 
message time [15]). Time in the table is expressed in milliseconds and is 
rounded. 

communication 

millisec 

714 

714 

714 
763.5 
763.5 
763.5 
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32 


10 


320tf 
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evaluation 


t+ 


u 
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23247 


195 


81 


232470 
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464940 
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619920 
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evaluation 

millisec 

1150 

11500 

23250 

1550 

15500 

31000 



The table illustrates several characteristics of this computation scheme; 
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1. While the communication time is substantial, it does not increase with 
N so long as the height of the tree is constant. Neither, in fact, do 
any of the quantities associated with the "tree" . These quantities are 
a function of the size of the tree, i.e., of n and h, but not the number 
of particles. 

2. With a low number of particles most of the time is communication 
time. 

3. Since the number of processors in the connection machine is fixed, to 
reach a high number of particles m has to increase. This shifts the 
computation load to the evaluation and to the bottom level (h level) of 
processors. They do most of the work while the rest of the processors 
are idle. 

4. The evaluation time could be reduced significantly by increasing the 
number of processors so that m remains 1. This cannot be done in the 
connection machine that has a fixed number of processors. 

7.2 The GR Algorithm - Comparative Speed 

The section starts by summarizing results from previous sections. 

messages: It is assumed that the GR algorithm uses n p coefficients; each 
coefficient is represented by a single precision 32bit word plus one 32bit word 
for specifying the origin of the cell. Thus, the length of a message is 32n p 
bits. A message-time unit is the time for one such message; the exact time 
on any particular hardware,say, the connection machine, depends on n p . 

Communication time: In 3D we have 8 children and 189 as the maxi- 
mum size of an interactive set. Thus, the communication time is 16h + 26.72 
(equation 24). 

Computation time: To simplify the discussion it is assumed t + = i.; 
t/ = 5t+. Under these simplifying assumptions the time spent in computing 
is: 

1. $n°: 

n p (4m - l)t + ; 



31 



2. Translate z : 



3. Convert z : 

4. Shift,: 

5. Evaluation: Potential: 
Force: 



dkpt+] 

3c p t + ; 

4cp< + ; 

3mn p £+; 
6mn p f+. 



Denote by n„ the number of neighbors and by n c hudren the number of 
children. The total time is: 

T(^n )+h.T(Translate z )+(n n ).n chi i dren .T(Convert 2 )+h.T(Shift z )+T(Evaluation) 

(57) 
(n p (4m - 1) + 3/jfep + 3n n n cWWren Cp + 4hcj, + 3mn p )t+ (58) . 

(n p (7m - 1) + (3fc p + 4c p )/i + 3n B n cWUreB Cp)< + (59) 

For reference in the sequel it is convenient to denote 

P = n p (7m - l)t+ (60) 

which is the time spent in the leaves of the tree, and to denote 

Q = ((3Ap + Acp)h + 3n n n cWWren Cp)< + (61) 

which is the time spent in the tree nodes that are not leaves. 

Cells and particles: Let the number of cells in the bottom level,/*, be 
N = n x n x n ; then h = flog 2 n] . The total number of cells is: 

iV(1 + l + 6i + -> * 8 7 N (62) 

Since the number of processors in the connection machine is 6AK we 
would like to restrict the number of cells to this number. 
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The total number of particles is rtiN. 

Summary: The following table summarizes communication and compu- 
tation times for selected values of n, m and p. The computation time is 
stated in terms of t+. The communication time is stated in terms of message 
units. 

The time for Translate+ Convert + Shift is denoted by Q. This time is 
independent of m. In the table this time is separated from the time which 
depends on m and which represents operations done by the bottom level 
processes. 

n v com- 

munication 



n m particles h p 



Q $n° + evaluation 
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The above table is translated below to "real" numbers: the communica- 
tion time assumes a 190bit message and coefficients of 32bit; the computation 
time assumes 20Kfiops per second, typical of the CM2 model connection ma- 
chine [14]. The communication time is assumed to be 250 microseconds per 
message [15]. 
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faster then the BH algorithm. An accurate comparison requires com- 
paring performance for the same global error. Such an analysis has not 
as yet been done. 

4. Unlike the BH algorithm, the GR algorithm utilizes all the processors 
more or less equally. In considering the replacement of a connection 
network by fewer, faster processors we can take advantage of the fact 
that the bottom level is significantly busier than the rest of the com- 
puting elements; using the GR algorithm all the processors have to be 
accounted for and replaced (more on this subject in the sequel). 

7.3 Mapping the "pyramid" into a hypercube 

The mapping of the network of processes into an hypercube is preparation 
for executing the algorithm on a connection- machine- like computer. The 
interest here is twofold. How fast can a hypercube machine simulate the 
network and how can the n-body computation be organized on the hypercube. 
The definitions in this section follow [17]. The actual mapping is a rather 
straightforward generalization of [18]. 

An embedding < <j>,p > of a graph G = (V G ,E G ) into a graph H = 
(Vh,Eh) is defined by an injective mapping from V G to V#, together with 
a mapping p that maps (u,v) € E G onto a path p(<f>(u),<j>(v)) in H that 
connects <f>(u) and 4>{v). 

The quality of an embedding is measured by three cost functions — ex- 
pansion, dilation ,and load factor. The expansion of an embedding < (f>,p > 
of G into H is the ratio of the size of V H to the size of V G . The dilation of 
an edge (u,v) under < <j>,p > is the length of the path p(<t>{u),<f>(v)) in H. 
The dilation of an embedding < <j>, p > is the maximum edge dilation, over 
all edges in G, under < <f>,p> . The load factor A(e) of an edge e in H is the 
number of paths that pass through e which are images of edges in G. The 
load factor of an embedding is defined to be the maximum load-factor over 
all edges in H. 

Consider the 2D case first. An n x n pyramid graph is formed from meshes 
of size n x n, | x f , ^ x |, | x |, ..., 1 x 1 connected as implied by figure 
5. Note that n is a power of 2 and that if each node is one of our cells, 
this pyramid does not contain the connection to the "diagonal" neighbors. 
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(Omission of the diagonal link does not change d in either the two or three 
dimensional cases. As it is shown in the sequel, d is the important quantity 
in our case.) 





Figure 5: A 2 x 2 and a 4 x 4 pyramid 



Lemma 7.1 An n x n pyramid can be embedded in a 2n 2 node hypercube so 
that dilation of the embedding is 3 and the load-factor is 2. 

See [18] for a proof. 

Lemma 7.2 Annxnxn pyramid can be embedded in a 2n 3 node hypercube 
so that dilation of the embedding is 4 and the load-factor is 3. 

The lemma can be proved similarly to the proof in [18]. 

Clearly, we are interested in a mapping that simulates the tree well. Since 
each process is mapped to a different processor, the issue is an issue of com- 
munication time rather then computation time. Maps differ from each other 
in the amount of communication overhead they introduce as compared to 
the tree. While there is no claim that the above mapping is optimal, it is 



36 





Figure 6: A simple 2D embedding. 

attractive since neighboring nodes remain close together. Moreover, we can 
bound the number of cycles of the hypercube machine needed to simulate a 
cycle of the pyramid machine. 

[17] considers the case of simulating G by H where all nodes sends mes- 
sages at the same time on all adjacent wires. The resulting bounds are 
rnax{d, X} <T <d\ where T is the number of cycles of H required to sim- 
ulate any cycle of G. These bounds are too pessimistic for our case. In our 
case an upper bound is simply the dilation d. This is indeed not the bound 
on a general computation performed on a pyramid network but a bound on 
our own algorithm with the above mapping. The bound results from the fact 
that in this mapping the only overlapping paths are those from a node to 
its children. Since only one child at a time sends messages to its parent, no 
conflict arises. Since in most part of the algorithms the information consists 
of several messages that can be queued inside a process, it is believed that d 
is actually a conservative upper bound. 

The conclusion is that hypercube communication can be bounded by d 
multiplied by the communication performance of the pyramid. We believe 
that this is a rather conservative estimation of performance; with the pyramid 
mapped as above the average performance of the connection machine will be 
very close to that of the pyramid machine. 

The expansion story is less attractive. A 32 x 32 x 32 base (32K cells in 
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Figure 7: A simple 3D embedding. 

the bottom level) requires 64K processors, £ of them not participating in 
the computation. 

7.4 A Small Number of Larger Processors 

This section considers the implementation of the algorithm by a small num- 
ber of larger processors. Both "small number" and "larger" are with respect 
to the connection machine. We consider either 1 processor, or , for 3D arrays 
of 2 x 2 x 2, 4 x 4 x 4, etc. of processors. What we have in mind for a 
processor is a board with fast arithmetic chips, memory and a controller. 
Currently, a board like this can deliver about 20M floating point operations 
per second and has a memory of several million words of 32 bits each. The in- 
terconnection among neighboring processors is via 32-bit buses. To facilitate 
simple programming and symmetric information transfer between processors 
the buses are interconnected as illustrated by figure 8. The processors are 
associated with grid points. Each processor has an input bus and an output 
bus. The input bus of a processor is connected (via tri-state gates) to the 
output buses of all its neighbors. The output bus of a processor is connected 
(via tri-state gates) to the input buses of all its neighbors. 
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Figure 8: Interconnection among neighboring processors. 

The pyramid of processes has to be partitioned and assigned to individual 
processors. It is assumed that this partition is symmetric; The base of the 
pyramid is equally divided among the processors, each getting a smaller 
pyramid; whatever is left is divided again, or assigned to one processor. For 
example, a 32 x 32 base pyramid of processors is divided among 4 processors; 
Each gets a 16 x 16 base pyramid; the top process remains and has to be 
assigned to one of the processors. 

There are several differences between this computation mode and the ones 
discussed before. First, no messages have to be sent between two processes 
that resides on the same processors (local processes). (A process accesses the 
results of another process directly.) Thus, in this case, the communication 
time is negligible. Second, a processor evaluates its processes sequentially; 
parallelism exists between processors only. Thus, the evaluation order has to 
be such that no processor will be waiting for data from another processor. 

The transfer of information between processors is different. It is assumed 
that each processor has a DMA channel so that data transfer to and from 
buses can be done simultaneously with computing. In the pyramid case, to 
save "wires", information was routed to interactive set members through par- 
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ents. Clearly, this does not make any sense for local processes; for processes 
in different processors the issue is how not to send the same information more 
than is necessary. Therefore, when a cell has a result which is needed by a 
cell in another processor, the result is sent over; the other processor has a 
memory cell for each cell on other processor that is in one of its own cells 
interactive sets. The arriving information is stored there to be used by all 
cells that need it. 

Let us consider a pyramid of processes with a base of n x n x n (e.g., 
n = 32) and an array (of larger processors) of size kxkxk, where k = 2,4, 8, 
etc. 

The pyramid of processes is partitioned among the array's processors so 
that each processor gets a pyramid of base £ x £ x £. The top of the original 
pyramid is not covered; In the case of base 32 pyramid, which is used here 
for comparison, and arrays of base 2, 4, and 8, the top is a pyramid with 
base 1x1x1,2x2x2, and 4x4x4, respectively. It is assumed that this 
tip is either assigned to one of the processors or divide among them. For the 
cases k = 2,4,8 the contribution of the tip pyramid to the total time can be 
ignored and this is what is done in the sequel. 

It is assumed that the systems of processors uses the same algorithm 
as in section 4, except that each processor services all processes in its own 
pyramid. The processors work in parallel and in synchronization so that no 
processor has to wait for the results of its fellow processors. 

It is not difficult to see that each processor implements somewhat less 
than if(^) processes. Similarly, a processor has to implement the communi- 
cation between the processes; the time for communication between processors 
residing on the same processor can be ignored. The communication time be- 
tween processors has to be accounted for. This communication is represented 
by "wires" to from processes on the pyramid faces to neighboring processes 
which reside on neighboring processors. In 3D the number of these wires is 
proportional to (£) and the sequel considers their effect on the communica- 
tion time. 

The calculation of the time computation is based on 20M floating point 
operations per seconds and the communication time is based on transfer time 
of 80 nanoseconds per 32 bits. 

Communication time: It is assumed that each processor has pseudo- 
cells , or p-cells, one p-cell for each cell z, such that z is a member of an 
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interactive set of a cell in the processor's pyramid, z not in the pyramid. 
The number of such cells can be calculated. For a cell y, y a cell on the 
pyramid face, these are the children(neighbors(parent(y))) which are not in 
the pyramid. (Note that parent(y) is in the pyramid and that we are inter- 
ested in neighbors(parent(y)) which are not in the pyramid.) If we "cover" 
the pyramid faces with such cells, this cover includes all the interactive set 
members of cells in the pyramid where the members are not in the pyramid. 
The number of cells in this cover is: 

(number of children)x(number of cells in the faces of a — x — x— pyramid) 

LiK £t\t £tfc 

(63) 
The number of cells in a face ofanxnxn pyramid is 



"^ + I + 17 + -) < o" 2 (64) 



11 x 4 

4 + 16+-X3 
Thus, the number of cells in the above cover becomes 



AA 



n 



8(3(f) 2 x6 + 81og(-)) (65) 

The last term are cells that cover the corners. This last term can be just 
ignored. 

Thus, the number of p-cells is about 16(|) 2 and this is the number of 
messages the processor has to send or to receive (ignoring the fact that some 
processors do not have neighbors on all directions). 

The GR algorithm defines the content of such a cell by n p + 3 coefficients 
(the 3 represents the coordinates of the source). Thus, the communication 
time is 

(n p + 3)x32xl6(^) 2 ( 80X 32 10 " 9 ) (66) 

Similarly, the BH algorithm defines the content of such a cell by 1 + 
3 coefficients (the 3 represents the coordinates of the source). Thus, the 
communication time is 

4 x 32 x <-)\^l) (67) 

Computation time: A processor that implement a J x | x | base 
pyramid has to perform all the operations that its processes do. 
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For the 3D GR algorithm the computation time is: 

n 3 8 8 8 

(-) (T($n°)+-T(Translate 2 )+-n set T(Convert z )+-T(Shift z )+T(Evaluation)) 

(68) 
Using equations (39), (43), (46), (49) and (50), the expression becomes 



n 3 8 8 

(-) (n p (4m - 1) + 3-& p + 3n 3etCp + 4-Cp + 3mn p )t+ 



which results in 



(-) (n p (7m - 1) + -(3fc p + (3n, et + 4)cp))t + . 



(69) 



(70) 



The computation time of the 3D BH algorithm almost entirely consists 
of the evaluation time. Thus, for a £ x £ x £ base pyramid the computation 
time is 

n 3 i 
(-r) (evaluation time for one processor of the pyramid) 

GR algorithm summary: The following table summarizes the compu- 
tation and communication time for the GR algorithm and for various values 
of n, p and A;. 



n m particles p 



n. 



32 20 

32 20 

32 20 

38 20 

38 20 

38 20 



640K 

640K 

640K 

1250K 

1250K 

1250K 



1 
2 
3 
1 
2 
3 



4 
10 
20 84 



7 7 
28 25 



4 
10 



65 

7 7 
28 25 



20 84 65 



k=2 
coram- comp- 
time time 



sec 



2.3 

4.27 

7.53 

3.25 

6.0 

10.6 



1.05 

3.6 

9.3 

1.76 

6.1 

15.6 



k=4 
comm- comp- 
time time 

sec 
0.57 0.132 
1.07 0.46 
1.87 1.16 
0.8 0.22 

1.51 0.764 
2.63 1.95 



k=8 
comm- comp- 
time time 



sec 



0.15 
0.27 
0.47 
0.21 
0.37 
0.67 



0.016 
0.057 
0.146 
0.027 
0.095 
0.244 



The above numerical results can be compared with the the performance of 
the GR algorithm on the pyramid (section 7.2). The following table compares 
the performance for p = 3. Define Ratio 



Ratio = 



(total time pyramid) 
(total time k x k x k array) ' 



(73) 
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n 


m 


^articles p 


Ratio 


Ratio 


Ratio 








k = 2 


k = 4 


k = 8 


32 


20 


640K 3 


0.32 


1.75 


8.61 


38 


20 


1250# 3 


0.20 


1.16 


5.23 



BH algorithm summary: The following table summarizes the compu- 
tation and communication time for the BH algorithm. 



n m particles 



32 

38 



20 
20 



640K 
1250K 



k=2 

comm- comp- 

time time 

sec 

1.3 126.4 

1.84 179.2 



k=4 
comm- comp- 
time time 

sec 
0.33 16.0 
0.46 22.4 



k=8 
comm- comp- 
time time 

sec 
0.08 2.0 

0.11 2.4 



The above numerical results can be compared with the the performance 
of the BH algorithm on the pyramid (section 7.1). 



n m particles 



32 

38 



20 
20 



640K 
1250K 



Ratio 
k=2 
0.15 
0.14 



Ratio 


Ratio 


k=4 


k=8 


1.21 


9.8 


1.25 


9.12 



Summary: It follows from the previous section that 

computation time pyramid + communication time pyramid < 

total time hypercube < (74) 

computation time pyramid + dx (communication time pyramid). 

where d = 4. Bounds for the ratio of the total time, hypercube, to the total 
time, array, can be derived from this relation. Since I believe that d = 4 
is a rather conservative bound, I am giving the hypercube the benefit of 
the doubt and, for comparison sake, consider its performance equal to the 
pyramid. 

The above numerical results indicates that the array can comfortably 
compete with either the pyramid or the hypercube (both with the CM2 
connection machine parameters). A general comparison of the array and 
the hypercube has to weight two additional factors: First, the hypercube, 
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8 Near-field Computation 

This section estimates the computation and communication time required 
for near-field computation. These estimates are added to the estimates of 
the far-field computation to yield an estimate for the total time required. 

Let q be a point mass at some cell n. The near-field force on q is the force 
exerted on q by particles in n an in neighbor s(n). This force is computed 
using 

_ M x M y Xi- yi 

**< ~ U T\ ^2ii i7> ( 75 ) 

II x - y \y \\x - y || 

where M x and M y are the two mass points in 3D whose coordinates are 
x = (xi,x 2 ,x 3 ) and y = (3/1,3/2,2/3), respectively, and F Xi is the force on M x 
in the x,- direction. This calculation is rather straight forward and the only 
issue is how to take advantage of the symmetry of the force (F Xi = -F Xi ) 
which, in parallel computation, requires some elaboration. 

Consider the pyramid computational structure. Only the bottom layer 
processes participate in the near-field computation. The following is a step 
by step description of this computation. 

• For all cells c in the bottom layer evaluate the forces on each mass in 
c due to the other mass points in c. 

The evaluation of equation 75 for all force components requires 2t+ + 
6t_ +23t, +t/ = 36f+. There are |m(m — 1) such computations; Under 
the assumptions stated in section 6 the total time is ym(ra — l)t + . 

• Consider a cell c and its neighbors, c and each of its neighbors, say e, 
determine a direction. This direction is determined by x e — x c where 
x e is the center of the neighbor cell and x c is the center of c. Clearly, 
if x e — x c is a direction, so is x c — x e . Let D be a set of directions such 
that if d is a direction the either d € D or — d € D but not both. 

For all directions d in D do 

— Each bottom layer cell c sends the value and coordinates of each 
of its mass points to its neighbor in the d direction. 
Communication time is m message unites. 
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- Each cell c provides 3m locations for the forces on the mass points 
whose values and coordinates it just received. Next it calculates 
the forces between any two pairs of mass points, one in c and one 
that its values just arrived from the neighbor. 
Computation time is 36m(m — l)tf + 

- The values accumulated in the 3m locations are sent back to the 
neighbor cell from which the mass points came. 
Communication time is m message units. 

Since the number of directions in D is *f the total time is 

36 n 

(ym(m - l)t+ + -f(Z6m(m - l)t+ + 2mt mewge uni< ) (76) 

Since the communication time is small as compared to the computation 
time we (almost) get the 1/2 factor discussed in the beginning of this section. 
This is achieved by having one processor calculate the force and send the 
results back to its neighbor. 

Using the CM2 connection machine parameters once again (20k floating 
point operations per second, 250 microseconds per message) the time for 
evaluating the near-field by the pyramid is given below (n„ is 26 for 3D), 
m communication time computation time 
n n m 36(^ + l)m(m - l)t+ 

msec msec 

10 65 2,268 

20 130 9,576 

The array of larger processors is faring just as well in the near-field com- 
putation. As seen from the above, communication time is negligible. Since 
only the bottom layer processors of the pyramid participate in the calculation 
we get for an n x n x n base pyramid 



pyramid time 1 



k x k x k processor array time (f) 3 , ^ 



(77) 



where 



, t + for a pyramid processor „ v 

speedup = ——- *-f £ (78) 

t+ for a larger processor 
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9 Summary and Conclusions 

The work formulates the N-body problem as a set of recursive equations 
with boundary conditions. The numerical solution to this set of equations 
can be viewed as a pyramid structure of processes. There are three kinds of 
processes: the bottom level processes, the top level processes and all the rest. 
All processors are implemented by means of five functions: $°, Translate, 
Convert,, Shift,, Evaluate. Both the BH and the GR algorithms have 
the same structure. They differ from each other in the above functions. 
This formulation of the N-body problem simplifies the understanding of the 
algorithms, their analysis and programming. 

The pyramid structure is mapped into three different architectures that 
can be viewed as possible hardware implementations. For each mapping the 
amount of time for arithmetic computations and the time for communication 
between processors is estimated. Out of these estimates several qualitative 
conclusions are derived: 

Parallel execution reduces both the BH and the GR algorithms to 0(log N), 
where N is the number of particles; this bound covers both computation and 
communication time. This is somewhat of a surprise, because in the serial 
computation the BH algorithm is considered a 0(N log JV) algorithm while 
the GR one is O(N). 

The algorithms exhibit both common and differing patterns of behavior. 
In both cases, given the the height of the tree h, the time spent in the tree 
is independent of m, the number of particles in a cell. The evaluation stage, 
however, is hard on BH; for large m most of the time is spent in it, and 
computation time far exceeds communication time. For large m the GR 
algorithm performs exceedingly well; the evaluation of the potential function 
for each particle is straightforward. Clearly, using a 38 x 38 x 38 pyramid, 
the GR algorithm can handle 10M equally distributed particles almost as 
fast as it handles 1M. At these numbers, the increase of m from 20 to 200 
increases the total time by about 25 percent only because the algorithm is 
bounded by the communication time rather than by the evaluation time at 
the bottom cell. 

Since the BH algorithm is much simpler to program it is worthwhile noting 
that its performance for 62K particles, one at a cell, is about equal to that 
of the GR algorithm when p = 2, using the same number of particles. A 
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complete comparison of BH and GR requires compaxing the algorithms for 
the same global error. This work has not been completed as yet. 

A hypercube machine can simulate the pyramid structure without conflict 
of messages and with a slowdown equal to or less than d = 4. 

Using the GR algorithm with p = 3 and a connection machine of the 
present size, it is possible to calculate a solution to the 1M particle (static) 
many body problem in about 24 seconds. This number is based on multiply- 
ing d = 4 by the communication time computed for the tree and adding the 
computation times for the far- field and for the near-field. It is considered a 
rather conservative estimate. This number is also the estimate for one time 
step for the dynamic many body problem. 

For running the N-body algorithm the connection machine has a clear 
advantage: it is an existing product that can be readily used. The analysis, 
however, points out some possible disadvantages; while 64K processors seem 
quite a large number, the 3D N-body problem reaches the limit rather fast. 
One has to increase m , assigning more particles per cell. Clearly, at least for 
the BH case, we would rather increase the number of processors; a feature 
the hardware does not support. 

Rough performance estimates for an array of faster processors indicate 
that with a modest size array it is possible to get performance competitive 
to that of the connection machine. 

The above calculations considered particles that are equally distributed 
in space. Often, physical system exhibit sparsity, certain regions of space 
have many particles while others are empty or nearly so. Taking advantage 
of sparsity awaits farther work. 

The computation and the communication times computed should be con- 
sidered initial estimates as data moving operations were not taken into ac- 
count. Moreover, these estimates await experimental verification, preferably 
by programs simulating the above processes. This work is currently under- 
way. 
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